[feat] Resume from ckpt by kevssim · Pull Request #135 · modelscope/twinkle

kevssim · 2026-03-31T01:45:42Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

在TrasnfomersModel和MultiLoraModel实现完整训练状态的恢复——包括优化器、调度器、RNG配置以及数据集跳过

gemini-code-assist

Code Review

This pull request implements a comprehensive "Strict Resume" feature for Transformers models, enabling the restoration of full training state including optimizer, scheduler, scaler, RNG states, and data progress. Key changes involve implementing load_training_state and read_training_progress across the model, server, and client layers, alongside dataloader enhancements to support sample-level skipping for map-style datasets. Feedback highlights several critical improvements: ensuring deterministic RNG in distributed settings by avoiding unseeded random states, addressing the deprecated use of StopIteration in generators, improving security by using weights_only=True during checkpoint loading, and removing an accidental BOM character in the client generator. Additionally, a more robust approach for re-initializing the dataloader is suggested to avoid modifying private PyTorch attributes.

src/twinkle/dataloader/retry_sampler.py

client_tools/client_generator.py

gemini-code-assist · 2026-03-31T01:47:55Z

src/twinkle/dataloader/dataloader.py

+            self.dataloader.__initialized = False
+            self._rebuild_sampler_stack()
+            self.dataloader.__initialized = True


Accessing and modifying the private attribute __initialized of torch.utils.data.DataLoader is brittle and relies on internal implementation details of PyTorch that could change. A safer approach to update the sampler stack after the dataloader has been created is to simply re-instantiate the underlying self.dataloader using the stored self.dataloader_params.

Suggested change

self.dataloader.__initialized = False

self._rebuild_sampler_stack()

self.dataloader.__initialized = True

if self.dataloader is not None:

self.dataloader = None

self._lazy_init_dataloader()

src/twinkle/dataloader/retry_sampler.py

src/twinkle/model/transformers/transformers.py

…into resume_from_ckpt

kevssim · 2026-04-01T06:42:20Z

/gemini summary

gemini-code-assist · 2026-04-01T06:42:35Z

Summary of Changes

This pull request introduces robust checkpoint resumption capabilities to the training framework. By enabling the restoration of full training states—including optimizer, scheduler, and RNG configurations—and implementing precise data skipping in the dataloader, the changes ensure that training can be reliably resumed after interruptions. Additionally, the PR optimizes checkpoint handling for FSDP2 strategies and adds necessary API endpoints to support these features in distributed and remote training environments.

Highlights

Checkpoint Resumption Support: Added comprehensive support for resuming training from checkpoints, including model weights, optimizer states, learning rate schedulers, and RNG states.
Dataloader Skipping: Implemented skip_consumed_samples in the dataloader to correctly resume data iteration from the exact point where training was interrupted.
FSDP2 Optimization: Enhanced FSDP2 strategy to support efficient saving and loading of wrapped optimizer states.
API Extensions: Exposed new server-side endpoints for loading training states and reading progress metadata to facilitate remote training resumption.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity

Pull request created by kevssim.
Automated code review identified potential issues with non-deterministic random state generation, use of private attributes, and security concerns regarding torch.load.
Author implemented fixes addressing random state seeding, deprecated StopIteration usage, and improved checkpoint loading security.
Refactored sampler stack rebuilding to avoid brittle modifications of dataloader internals.

src/twinkle/model/transformers/transformers.py

kevssim and others added 13 commits March 27, 2026 12:00

docs: add transformers resume design spec

5cd3c0f

docs: refine transformers resume design spec

91eeaeb

docs: trim resume state fields

6eebda8

docs: add npu resume compatibility requirements

cdd9c1b

chore: ignore local worktrees

1542492

wip

9883118

wip

d41a634

wip

21f9918

fix

1e59531

wip

9bb3f39

fix

fdf1f71

wip

6cf5160

Merge branch 'modelscope:main' into resume_from_ckpt

144ffe6

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

kevssim added 15 commits March 31, 2026 09:52

lint

e21f870

Merge branch 'resume_from_ckpt' of https://github.com/kevssim/twinkle …

3359209

…into resume_from_ckpt

wip

70ebe50

wip

483778d

wip

039789b

wip

54de1a4

wip

920ab86

lint

ffd6304

wip

582bd41

wip

9cb6106

wip

c0cf72e

wip

505a75c

fix

a222b5b

wip

7499e00

doc

cd0b094

kevssim commented Apr 1, 2026

View reviewed changes

src/twinkle/model/transformers/transformers.py Outdated Show resolved Hide resolved

kevssim added 2 commits April 1, 2026 17:08

wip

abf2c2f

lint

8bf7a6a

kevssim marked this pull request as ready for review April 1, 2026 09:32

kevssim changed the title ~~Resume from ckpt~~ [feat] Resume from ckpt Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Resume from ckpt#135

[feat] Resume from ckpt#135
kevssim wants to merge 30 commits intomodelscope:mainfrom
kevssim:resume_from_ckpt

kevssim commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Uh oh!

Uh oh!

Uh oh!

kevssim commented Apr 1, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kevssim commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kevssim commented Apr 1, 2026

Uh oh!

gemini-code-assist bot commented Apr 1, 2026

Summary of Changes

Highlights

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kevssim commented Mar 31, 2026 •

edited

Loading